Coprocessor circuit architecture, for instance for digital encoding applications

ABSTRACT

A coprocessor circuit for processing image data in digital form, having a motion vector controller block for generating, starting from said image data, motion vector values. Such vector values include predictor data and macroblock data relating to a current macroblock of said image data to be estimated, the prediction data and macroblock data being adapted to be stored at respective memory addresses. An address generator block is provided for extracting said respective addresses from said motion vector values. A predictor fetch block for retrieving said predictor data based on respective addresses extracted by said address generator block, a current macroblock fetch and distengine block for retrieving said macroblock data based on respective addresses extracted by said address generator block and for processing said macroblock data according to a given function, all provided, as well as a decision block for collecting said retrieved data as partial results and selecting the best result therefrom.

TECHNICAL FIELD

[0001] The present invention relates to circuit architectures, and wasdeveloped with a view to the possible use for digital encodingapplications.

BACKGROUND OF THE INVENTION

[0002] Several digital video encoding standards have been developedduring the past years, but the most important for the present andforeseeable future are:

[0003] MPEG-2 for television-like resolutions and high bitrates (greaterthan 1.5 Mbits/s) for digital video cameras, DVD recordable applications

[0004] MPEG-4 or H263 for video telephony (especially for wirelessmobile terminals) for lower resolutions (e.g., QCIF -176 by 144 pixels)and lower bit rates (less then 1 Mbits/s)

[0005] While the following explanation will be provided by primarilyreferring to MPEG-2, the same points apply in principle to the otherstandards listed as it can be gathered, e.g., from the ISO/IEC 13 818-2MPEG-2 and ISO/IEC 14 469-2 MPEG-4 video coding standards.

[0006] The encoding process is based on several tasks in cascade, ofwhich motion estimation is by far the most expensive computationally.The standard defines the output of the estimation block (a motion vectorand the prediction error), but leaves freedom on how this estimation isdone, so that encoder providers can use a preferred estimation techniqueand implementation to add value to their box (lower cost, higher picturequality). After motion estimation a set of decisions have to be taken onhow one wants to encode each MB (MacroBlock, the “quantum” or basicbuilding block in which is decomposed every picture for motionestimation). Also one must provide the predictor itself (i.e., themacroblock that the estimation process has found to be best matching tothe one currently under process) to the rest of the encoder chain.

[0007] All these operations require so much computational power that itis impractical to implement them even on very high performance CPUs/DSPswithout heavily compromising on overall picture quality of the encodedbitstream. On the other hand, to be able to support different standardsand to be able to tweak the motion estimation algorithm, means arerequired adapted to be programmed or even re-programmed on the field,for example by downloading off-the-air the new version of the algorithmon the terminal. The motion estimation algorithm is not fixed by thestandards and it is crucial to give a performance competitive advantageto the overall encoder. So a better version of the motion estimationalgorithm can result in increased perceived performance of the overallencoder.

[0008] Another key aspect of the motion estimation task is its memorybandwidth requirement. As an extensive search for the best match must beperformed within very large search windows, all the algorithms tend toeat up a large amount of system memory bandwidth. Typical bandwidth(B/W) figures for this task are in excess of 100 MB/s. This has two maindrawbacks: expensive high-speed and/or wide-wordlength memory devicesare required and power consumption is increased, as higher external I/Oactivity means more power wasted on the device's heavily (capacitive)loaded external pins.

[0009] These reasons lead to the need for a motion estimator algorithmthat has a low cost (low computational complexity) yet a highperformance in terms of picture subjective quality and for a motionestimation engine that is equally cost effective (low area), flexible(SW programmable), low bandwidth and low power, as most of theapplications target battery-powered mobile terminals (cameras, cellularphones).

[0010] Examples of prior attempts by others are described in thefollowing documents, e.g., EP-A-0 895 423, EP-A-0 895 426, EP-A-0 893924, EP-A-0 831 642, U.S. Pat. No. 5,936,672 and U.S. Pat. No.5,987,178.

[0011] Once the key characteristics of a motion estimator engine areidentified, architectural solutions that can achieve those goals must befound. The required features are low-cost (i.e., low area), lowbandwidth, low power, high flexibility.

SUMMARY OF THE INVENTION

[0012] In one preferred embodiment, the invention provides a SLIMPEGHardware Engine (SHE) motion estimator coprocessor for digital videoencoding applications. The approach that has been followed for itsarchitecture is to provide as much flexibility as possible in terms ofalgorithms and encoding standards supported, whilst keeping a verycost-effective and power-friendly implementation. The same area size andpower consumption characteristics of an hardwired implementation areprovided, yet keeping all the flexibility of a software implementation.

[0013] The engine is composed by a novel low-cost small-area pipeline, acache-based internal storage for the search window pixels yielding B/Wand memory size savings versus a conventional approach, and a DSP microcontroller to achieve software flexibility. This architecture is helpfulfor low-cost and low-power implementations such as digital video camerasor 3G wireless terminals incorporating video transmission capabilities.

[0014] Being a micro coded engine, the solution of the invention can rundifferent motion estimation algorithms (provided they do not requiremore then the SHE intrinsic computational power), although SHE has beenspecifically designed to support the SLIMPEG recursive motion estimationalgorithm, in all its versions and variants as described in variouspublications, for example, see European Patent applications 97 830605.8, 98 830 163.6, 97 830 591.0, 98 830 689.0, 98 830 600.7 and 98 830484.6.

[0015] The solution of the invention is adapted to support differentdigital video encoding standards, including MPEG-2, MPEG-4 and H263.

[0016] In a traditional approach during motion estimation, the algorithmsearches for the best match inside a predefined search window. Todecrease memory bandwidth, usually the engine has a built-in localmemory to buffer the entire search window. This leads to a substantialamount of memory required, in the range of 40 KBytes for typical PALframes search windows (+/−120 horizontally, +/−72 vertically). As themotion estimator moves on subsequent macroblocks, it must update thelocal search window to follow the current macroblock. This update takesanyway a substantial amount of bandwidth, typically in excess of 100MB/s.

[0017] In a preferred embodiment of the invention a differentarchitectural approach is used to search window buffering: the internalmemory is managed as a CPU cache, loading the search window pixels onlywhen they are really needed and buffering them in the dynamicallyallocated internal memory. Due to its statistical averaging nature,caches are not generally deemed safe for real-time operation. For thisreason, a bus access limiter (briefly called a “bandwidth cap”) has beencoupled to the cache refill engine. This device will monitor andinfluence bus accesses, effectively clipping the sporadic high-bandwidthpeaks that could occur in particularly stressing macroblocks, to assurethat the real-time B/W budget is never exceeded. This is enforced on amacroblock by macroblock level, thus ensuring very fine grained controlon B/W. The maximum allowed B/W value can be dynamically changed, basedon system configuration or working conditions (e.g., battery level:lower B/W means lower power consumption).

[0018] To perform motion estimation, means are required to gauge if apredictor is better than another; a usual cost function for that is totake each respective pixels, make the absolute difference and accumulateit for the all macroblock. This pixel comparison is called Sum ofAbsolute Differences (SAD). The overall macroblock figure is insteadcalled Mean Absolute Error (MAE). A hardware block is thus required toperform SAD operations efficiently.

[0019] Conventional implementations of this function are via systolicarrays engine, arrays of 16 by 16 (=256) SAD processing elements,computing each clock cycle one MAE figures. These blocks arecharacterized by very fast computation speed, but also by relativelyhigh complexity, as they use a lot of processing elements (PE) and theymust gather and move all the data and partial results to keep the enginegoing.

[0020] SLIMPEG features can once again be exploited to decreasecomplexity. This means that one only needs a mono dimensional array of16 SAD elements. This can be called a “distengine”, as the MAE is alsoknown in technical literature as “level 1 distance”. A solution can thusbe selected that is 16 times less complex in principle (16×1 vs. 16×16SAD element).

[0021] The flexibility needed is therefore on motion vectors selection,search windows parameters, matching modes, coefficients, thresholds,matching block size, and so on. This can be achieved by a pipelinecontrol that is not based on hardwired Finite State Machines but on amicro code running on a dedicated controller/DSP.

[0022] All the algorithm characteristic then are not frozen in thesilicon but residing in a flash memory then can be easily reprogrammed,allowing maximum flexibility.

[0023] In the presently preferred embodiment, developed with respect toMPEG-2, the solution of the invention will support:

[0024] Frame pictures organization

[0025] Fully programmable motion estimation algorithm

[0026] Frame and field prediction modes (four field modes: Top/Bottom ONTop/Bottom)

[0027] Programmable GOP M=1, 2, 3, any N value (but must be a multipleof M by MPEG-2 standard)

[0028] B picture support for M>1 (backward, forward, interpolated mode)

[0029] Dual prime prediction for M=1

[0030] Half pixel accuracy during the whole estimation process

[0031] Prediction based on 16 by 16 pixels macroblocks

[0032] Unlimited telescopic search windows (up to maximum size allowedby MPEG2 MP@ML 1023.5 by 127.5)

[0033] Luma prediction error for winning predictor dma-ed to externalbuffer/memory. Alternatively (programmable), predictor and currentmacroblock can be output.

[0034] Intra/not intra coding decision

[0035] MC/not MC coding decision

[0036] DCT type (frame or field mode) coding decision

[0037] Activity index computation

[0038] Scene change detection

[0039] Inverse 3/2 pull down detection

[0040] Interlaced or progressive picture content detection

[0041] Concealment motion vectors for I pictures

[0042] Automatic f_code decision at frame level.

[0043] Programmable bandwidth cap (bus accesses limiter)

[0044] DMA gathering and delivery to external buffer of chromaprediction error (Optional)

[0045] Motion compensated noise level estimation and reduction on lumacomponent (Optional)

[0046] In the foregoing “Optional” means that hardware means could bebuilt-in to support the feature. If the feature is not needed, therelevant hardware will not be present in the device.

[0047] In the presently preferred embodiment the solution of theinvention will take as input the source original or reconstructedimages. In particular, SLIMPEG coarse search will always be performed onthe original prediction pictures, whilst fine search will always beperformed on the reconstructed anchor frames. Of course, during motionestimation only the luma component of the images will be used.

[0048] Images will always be stored as frames, even if they come frominterlaced sources. Pixels will be 8-bit unsigned integer quantities.Prediction error pixels will be 16-bit signed integer quantities.

[0049] Images in memory are always assumed to be in macroblock (orblock) tiled format. That is, all the pixels of a (macro)block willreside in consecutive addresses of memory, to optimize cache refillaccesses. Inside each (macro)block, scan order will be from top tobottom and from left to right (lexicographical order).

[0050] The source images can be independently pre-processed for formatconversion and/or noise reduction. Alternatively, motion compensatedtemporal noise reduction means (for luma) can be added to the SHE. Theresults of motion estimation, prediction error computation and decisionprocess will be:

[0051] Motion vectors: these will be in X and Y relative position, halfpixel accuracy (i.e., a value of 1 means a 0.5 pixels displacement).Signed 16-bit values will be used for each field. These motion vectorswill be then re-used for recursive estimations according to the SLIMPEGalgorithm. Both coarse and fine search vectors will be available inexternal memory, although only fine search vectors will be used forbitstream creation. Coarse search vectors can be used for ancillaryalgorithms.

[0052] Luma prediction error (alternatively, luma predictor and currentmacroblock): these will be DMA-ed in an intermediate buffer, to be readby the loop encoder. In case prediction error is required, it will be insigned 16-bit values, requiring a total storage area of 512 bytes perluma macroblock. In case separate current and prediction MBs need to beoutput, the same 512 bytes area will be used as unsigned 8-bit values tohold the current MB in the first half and predictor in the second halfof the buffer.

[0053] Optionally, the same output can be provided for the chromacomponents of the frame. In this case, one 256 bytes area is required(4:2:0 format). U and V components will be stored sequentially.

[0054] Decisions results, in the form of a set of flags and activitycoefficient.

[0055] The arrangement of the invention lends itself ideally to beincorporated in the form of an integrated circuit (IC), preferably ofthe monolithic (single-chip) type.

BRIEF DESCRIPTION OF THE DRAWINGS

[0056] A preferred embodiment of the invention will now be described, byway of non-limiting example only, with reference to the enclosed figuresof drawing, wherein:

[0057]FIG. 1 represents the overall architecture of a hardware engineaccording to the invention,

[0058]FIG. 2 and FIG. 3 show coarse and fine search overlap in thecircuit of the invention,

[0059]FIG. 4 shows coarse/fine prediction frames overlap,

[0060]FIG. 5 shows a typical MPEG-2 front end processing flow,

[0061]FIG. 6, including three portions designated a), b) and c), showsexemplary motion vector (MV) management in the circuit of the invention,

[0062]FIG. 7 and FIG. 8 show motion vector fields usage, for coarse andfine search fields, respectively,

[0063]FIG. 9 shows address generator (AG) function of the circuit of theinvention,

[0064]FIG. 10 and FIG. 11, thus latter including two portions designateda) and b) shows predictor fetch (PF) and block cache management in thecircuit of the invention,

[0065]FIG. 12 shows a cached search window with bandwidth cap,

[0066]FIG. 13 shows predictor alignment (PA) interpolation blocks,

[0067]FIG. 14 show a so-called distengine implementation within theframework of the invention, and

[0068]FIG. 15 shows an example of pipeline data flow in the circuit ofthe invention.

DETAILED DESCRIPTION OF THE INVENTION

[0069] In the drawing annexed, FIG. 1 shows a presently preferredembodiment of the SLIMPEG Hardware Engine (SHE) circuit architecture ofthe invention.

[0070] The engine includes a Motion Vectors (MV) generation controller10, a matching error computing pipeline 11 (pipeline flow is from leftto right in the drawing), a local cached memory 12 and by BUS interface13. Each stage is not a straight combinatorial one as in GPCPUs, but isactually a multi-cycle elaboration block. This means that each stagemight have multi-cycle inputs (i.e., will require inputs for two or moreconsecutive cycles), multi-cycle elaboration (i.e., the input→outputdelay will be more than one cycle) and multi-cycle output (i.e., theoutput will last for more than one cycle). This is explained in moredetail in the following in connection with FIG. 15.

[0071] SLIMPEG is based on two distinct estimation steps for eachpicture, the coarse search and the fine search. For real-timeimplementation constraints, these will operate in parallel on differentmacroblocks, time-sharing the HW resources of the SHE. Each macroblockperiod SHE will generate the result of the coarse search for amacroblock, and the results of the fine search for another one. Thisoverlapping is shown in FIGS. 2, 3 and 4.

[0072] Specifically, FIG. 2 shows that both coarse and fine searchfunctions use the same hardware resources in time division.

[0073] Inside the engine, operation is directed by the MV Generatorcontroller (MVG), which is in charge of selecting the motion vector totest according to the SLIMPEG algorithm and keeping track of the timeused for each macroblock to correctly synchronize its input /outputoperations. With its spare processing power, it runs ancillaryalgorithms like scene change detection, inverse 3/2 pulldown and so on.The MVG will then generate MV coordinates and control words to instructthe pipeline on how to exactly use the motion vectors.

[0074] The address generator (AG) 101 will then translate the motionvector's XY displacements into blocks physical addresses in memory, tobe used by the predictors fetch (PF) 102 stage. The prediction pixelsextracted are then aligned and (if appropriate) interpolated by thePredictor Alignment (PA) 103, and then fed to the Current MB Fetch andDistengine (CFD) 104 to fetch the current macroblock under predictionand compute the mean absolute error (MAE) of the prediction. Thedecision block 105 will gather all the MAEs and decide which is the bestprediction. After that, the intra/not intra, mc/not mc, DCT type codingdecisions, activity index are computed on the winner predictor, thenDMAed to the loop encoder together with the prediction error. Computedmotion vectors winners will be fed back to the MVC as needed by theSLIMPEG algorithm.

[0075] Optionally, the SHE could also support DMA fetch and predictionerror composition for the chroma part of the image. In that case, adedicated block inside SHE attached to the decision stage will take careof that.

[0076] Also optionally, temporal noise reduction means could be attachedat the output of the decisions block to noise-reduce the source images.This block will perform motion compensated noise level detection andreduction based on the motion vectors resulting from the coarse search.The coarse search current macroblock and its predictor will form theinput of this block, which will output a noise-reduced version of thecurrent macroblock that will overwrite the noise corrupted one.

[0077] In FIG. 5 there is shown a functional diagram for a typicalMPEG-2 front end part when using SLIMPEG and SHE to implement it.

[0078] Input frames will be stored in main memory from the video inputdevice. For the sake of simplicity these images are assumed to bealready of the correct format and scan needed for processing (e.g., D14:2:0 format and MB tiled scansion in memory). An incoming image will befirst read by the coarse search process to be the object of estimation.As this proceeds, prediction blocks will be fetched as the B cachegenerates misses. For each of the current image macroblocks, a coarsemotion vector and prediction error will be computed. The MV will bestored in the MV field in main memory (not shown in picture) to form thebases for the fine search on the same image and for the coarse search ofthe next image. The MV (if needed), the current and the prediction MBwill also be output to the MCNR block, which will cancel (most of) thenoise carried in the current MB, enhancing picture quality andcompression efficiency. This filtered macroblock will overwrite theoriginal one, and therefore a noise reduced version of the source imagewill form in memory. This noise reduced version will be used as thecurrent frame for fine search estimation. The prediction frames usedwill be the noise reduced anchor frames, coded and reconstructed.

[0079] Meanwhile, fine search will run concurrently. For B pictures,this will be running on different pictures (i.e., while coarse searchestimates picture N, fine search will estimate picture N-3 in temporalsource order). Therefore those will be two completely independentprocesses. During P estimation anyway, coarse and fine search willoperate on the same picture, with just a few macroblocks delay. It istherefore necessary to take care that the noise-reduced version of thesource picture will be used as the current MB. This is done forwardingthe result of the MCNR to the fine search process. In actual hardware,this results in just a macroblock buffer, as coarse search, MCNR andfine search will run on the same SHE engine. Moreover, this will save 20MB/s, as the write and reload operations are in this case redundant. Asusual, fine search will fetch the prediction blocks needed from theanchor frames, and will produce a best predictor, along with all thedecisions taken for that macroblock. These will be given to the loopencoder, to continue the processing chain.

[0080] The MVG 10 is the controlling block of the coprocessor, beingresponsible to generate the test motion vectors with the appropriatecontrol words. It will also be responsible for the overall timing of theengine, in order to synchronize SHE inputs and outputs with theappropriate time slots. Beside these main features, we will use itsspare processing power is used to compute the “encoding enhancing”ancillary algorithms such as scene cut detection, inverse 3/2 pulldown,interlace/progressive content detection, f code adaptation. All thesealgorithms are based on indexes computed starting from SLIMPEG coarseprediction motion vectors field, thus with low complexity.

[0081] The MVG has a built-in counter that will allow it to take countof the cycles spent to estimate the current macroblock. Normally, eachmacroblock estimation will take less than 24,7 μs (the macroblock sourceperiod) to complete, so SHE could run ahead of the video input device.This can be avoided by this control, that will keep SHE in synch withinput, inserting stall or power down cycles (or, alternatively,additional motion vectors tests) to wait for the input. In the samemanner, in some worst cases, memory bus traffic could cause SHE to stallfor too many cycles, causing it to exceed the macroblock period. Whenthis happens, this could lead to missing rendezvous with the loopencoder. In this case, similarly, the timer function could cause theestimation to finish in order to give the result to the loop encoder.

[0082] All these functions are preferably microcoded to allow upgradeand feature changes. Therefore, the MVC is a microcontroller or DSPdevice rather than an hardwired FSM. To achieve maximum optimization, itis possible to design a custom microcontroller, with a custom ISA andimplementation. The choice of which DSP to use is done on its ability tosupport the required tasks and on its availability. The D950 DSPmanufactured by STMicroelectronics is a preferred choice for thatpurpose.

[0083] Because of the recursive nature of SLIMPEG, a buffering circuitis needed in order to be able to re-use the generated motion vectors.Buffers are required in the main memory as well as on board of the MVG.The latters will be simple FIFOs or circular buffers, that can beimplemented in the X or Y memory of the D950 DSP.

[0084] As for the size and quantity, several “slices” of vectors in theD950 local memory and MV fields in main memory are required. A slice isan horizontal line of 45 MB; a slice of vectors is therefore composed bythe 45 MVs associated with those macroblocks; but 46 or 47 MVs FIFO areactually used as described later. Each slice will then require 184 or188 bytes, as each MV will use a 32-bit word. Each “MV field” will bethe collection of the 1620 (PAL) or 1350 (NTSC) MV associated to eachmacroblock of a picture. This means 6480 (PAL) or 5400 (NTSC) bytes foreach MV field.

[0085] Operation of the slice MV FIFOs and MV fields is as depicted inFIGS. 6, 7 and 8.

[0086] The following MV fields are needed in the memory (M<=3operation):

[0087] 2 previous coarse search +1 current coarse search=19440 Bytes(max, for PAL). A fourth MV field is not needed because the P picture MVfield can be discarded as soon as estimation thereof is finished:Coarse: I0 B1 B2 P3 B4 B5 P6 Fine: — — — P3 B1 B2 P6 MVFields: I0 B1 B1,B2 B1, B2, P3 B1, B2, B4 B2, B4, B5 B4, B5, P6

[0088] No MV field is needed for the fine search, as all the informationneeded is kept in the on-board FIFOs and then discarded.

[0089] Normally, the SLIMPEG algorithm will need the MV of themacroblocks around the one under prediction. These can be kept in sliceFIFO. The slice FIFOs can be divided in two types: a first type,“spatial” FIFOs contain MV resulting from previous estimation of MB inthe same frame. More precisely, they will contain the result of theestimations of the last 46 macroblocks. The input of these FIFOs willcome from the Decision stage, in the form of the last MV winner for theprediction/search mode to which the FIFO is devoted. The MV coming outof this FIFOs will be either stored in the Coarse MV field in mainmemory in case of coarse search, or dropped in case of fine search.

[0090] The second type will be “temporal” FIFOs, that will containresults from estimations of MBs in previous pictures or previous passesof prediction. This FIFO will contain 47 MVs. These MVs will be loadedfrom the Coarse MV fields in the main memory. In case of coarse search,the vectors will come from the coarse MV field of the previous (in inputorder) frame. In case of fine search, the vectors will be the onecomputed in the coarse pass of the same picture. The MV coming out ofthese FIFOs will always be dropped.

[0091] The following on-board MV slices will be needed:

[0092] 5 fine search “spatial” MV slices for forward prediction (1frame, 4 field modes)

[0093] 5 fine search “spatial” MV slices for backward prediction (1frame, 4 field modes)

[0094] 1 fine search “temporal” MV slice

[0095] 1 coarse search “spatial” MV slice

[0096] 1 coarse search “temporal” MV slice

[0097] The total amount is 2400 Bytes.

[0098] As these FIFOs are SW operated by a D950 DSP what is needed isthe actual space in XY memory; FIFO management will be done by D950.Also note that even if some version of SLIMPEG might not use all theinformation stored in all the slice FIFOs, (e.g., v5.2 uses only T0 andT1 temporal MV for both fine and coarse passes), these FIFO are kept inthe specifications to allow more freedom in the algorithmic enhancement.

[0099] With all these mechanisms in place, the MVG will be able tocorrectly generate MVs to test. The output of the MVC will therefore be:

[0100] pred_pos(15:0) X HALF PIXEL absolute predictor position(unsigned)

[0101] pred_pos(31:16) Y HALF PIXEL absolute predictor position(unsigned)

[0102] mv(15:0) X HALF PIXEL Motion Vector COORD (modulo-2 signed)

[0103] mv(31:16) Y HALF PIXEL Motion Vector COORD (modulo-2 signed)

[0104] Notethat pred_pos=current_mb_pos+mv;

[0105] MV control word: a 32-bit bit field, specifying how the relatedmotion vector must be used. The control word layout will be as follows:

[0106] SEARCH STEP FLAGS (1:0)

[0107] 1: COARSE_STEP_FLAG

[0108] 0: FINE_STEP_FLAG

[0109] PREDICTION TYPE FLAGS (7:2)

[0110] 2: FRAME_PRED_FLAG

[0111] 3: FIELD_T_ON_T_FLAG

[0112] 4: FIELD_T_ON_B_FLAG

[0113] 5: FIELD_B_ON_T_FLAG

[0114] 6: FIELD_B_ON_B_FLAG

[0115] 7: DUAL_PRIME_PRED_FLAG

[0116] PICTURE TYPE FLAGS (11:8)

[0117] 8: I_PICT_FLAG

[0118] 9: P_PICT_FLAG

[0119] 10: B1_PICT_FLAG

[0120] 11: B2_PICT_FLAG

[0121] 12: RESERVED FOR FUTURE USE

[0122] PREDICTION DIRECTION FLAGS (15:13)

[0123] 13: FORWARD_FLAG

[0124] 14: BACKWARD_FLAG

[0125] 15: INTERPOLATED_FLAG

[0126] NEWS FLAGS (17:16)

[0127] 16: NEW_CURRENT_MB_FLAG

[0128] 17: NEW_CURRENT_FRAME_FLAG

[0129] VECTOR TYPE FLAGS (21:18)

[0130] 18: UPDATES_FLAG

[0131] 19: TEMP_SPAT_FLAG

[0132] 20: ZERO_MV_FLAG

[0133] 21: NULL_MV_FLAG

[0134] MISC FLAGS (26:22)

[0135] 22: MULTI_PREDICTION_FLAG

[0136] 23: MULTI_PREDICTION_LAST_FLAG

[0137] 24: RESERVED FOR FUTURE USE

[0138] 25: TAKE_DECISION_FLAG

[0139] 26: COARSE_OFF_FLAG

[0140] 31:27: NOT USED/RESERVED

[0141] Each predictor is a 16 by 16 bidimensional array of pixels, thatcan be located anywhere in the prediction frame. Actually, due to halfpixel interpolation, a 17 by 17 array is generally needed. If this 17 by17 array is applied into the blocks grid, it usually lays into 9 blocks(see FIG. 9).

[0142] As the cache is organized in blocks, those 9 blocks need beaccessed. This stage then, taking the output of the MVG, outputs in ninesequential cycles all the nine addresses we need to fetch the predictor.As the address space in which the frame buffers is assumed to becontained in one 8 Mbytes chunk of memory (consecutive in address andaligned to an 8 Mbytes boundary, so that the most significant addressbits will not change), only 23-bit addresses need be delivered. Ofthese, the 6 least significant will always be ‘0’, as whole blocks areaccessed. Therefore, only 17 significant bits must be generated. In someparticular cases not all the nine blocks, but only 6 or even only 4 needbe fetched. This happens when the absolute coordinate (i.e., current MBposition+motion vector) of the predictors are block aligned, i.e.,X|Y_half_pixel_coord REM 16=0. In this case, the PA will still issue allthe nine addresses, but it will flag as ‘voids’ the one that do not needloading. This will save bandwidth.

[0143] The output of the block will then be:

[0144] control_word, mv, pred_pos as above

[0145] address(17): VOID_ADDRESS_FLAG

[0146] address(16:0): block_address(22:6)

[0147] For nine consecutive cycles for each predictor issued by the MVG.The addressing scan order will be from top to bottom and from left toright. The MV coordinates and control word will be propagated to thenext stage.

[0148] The PF stage 102 is responsible for physically gathering the 9blocks in which the predictor to be tested is located. The PF will firstlook into its block cache for the requested addresses, and, in case ofmisses, will output a request to the main memory via the STBUS port tobring into the local cache the needed block(s). The PF will bephysically composed by a memory, a cache refill engine, and all thelogic to handle the inputs from the AG and the outputs to the PredictorAlignment stage.

[0149] The cache is logically organized as a 4-way set associative one,with a total memory capacity of 16 KB. Each cache line will contain oneblock, i.e., 64 pixels, 8-bit each. It is possible to selectively readall the bytes of the block, or only the ones belonging to one field,being it top or bottom. This can be achieved either by a field_selectcontrol bit in the memory or by physically splitting the memory into twosub arrays, of the size of 32 pixels each. Accesses to the data loadedin the cache from the PF will always be read ones. Writes to the cachewill only happen when refilling the engine. Therefore there is no needfor any write-back or write-through capability, nor of any invalidationoperation. Cache coherence is not a problem either, as the predictorsframes will remain constant during the time of motion estimation.Therefore a very simple cache controller is needed.

[0150] As it has been stated, the cache appears logically as a 4-wayone. In a general purpose CPU, this is implemented with a 4-fold splitof the physical memory to access simultaneously all the 4 ways, while atthe same time performing tag lookup. This would lead to a great powerconsumption, especially taking into account the very wide cache word(512 or 2×256 bits). In the SHE instead, tag lookup and cache memoryaccess operations will be performed sequentially in two clock cycle.This leads to 75% power saving. The address generation and datautilization are not directly in closed loop, so this latency is hiddenby the pipeline (see FIGS. 10 and 11). The requirements for memory willtherefore be 1 single ported memory of 256 words of 512 bits each with afield select control pin, or 2 single ported memories of 256 words of256 bits each. Stated otherwise cache 4-ways are “emulated” by a singlememory: the absence of multiple read in parallel from 4 blocks saves 75%of power; the delay introduced is negligible for S.H.E. operation. Readstages (req_addr to cache_addr; cache_read) are pipelined, so that onepipelined read per cycle can be effected.

[0151] Emulated ways are stacked one over the other in a single physicalinternal memory. Concerning the cache memory architectures, at least twosolutions can beused.

[0152] A first solution (FIG. 11a) is one memory array, with fieldselect pins; this is required because sometimes only half of the data,and sometimes all are needed; this could save 50% power when half theword (i.e., one field) only is required. As an alternative (FIG. 11b),if the memory in A is not available or more power consuming then B, twoseparate memories of 8 KB each are used; the two memories could evenshare same address decoder if power optimization is substantial; nobit/byte enable is needed in this case, always read/write the wholeword.

[0153] The refill logic will also enforce the “bandwidth cap”: aregister held into this block and programmable by the system control CPUwill tell how many blocks the stage is allowed to request to the mainmemory for each macroblock's coarse and fine search respectively. Oncethis limit is reached, the refill engine will not perform any refill ofthe cache, thus not exceeding the allowed peak bandwidth in everymacroblock period (see FIG. 12). Of course, in this case the PF will notbe able to construct the 9-blocks region from which to extract thepredictor, and we will have to discard this motion vector, and not tocount it among the candidates for the final predictor winner. This isindicated by setting the NULL MV FLAG in the control word. The data usedto fill that missing block(s) will of course be “don't care” andimplementation dependent, as the predictor will never be considered as avalid candidate.

[0154] If the address to be fetched is flagged as VOID_BLOCK_ADDRESS thePF stage will not generate any access to the cache, and fill the blockwith “don't care” and implementation dependent data, as they will notactually be used for the predictor construction.

[0155] In case of a miss happening, this will of course cause all thepipeline to stall for as long as it takes to load the missing block. Thestall will be propagated with the normal stages handshake mechanism,meaning that the delay in outputting the missing block and in consumingthe subsequent inputs will cause the other stages to stall for theappropriate time. The addresses generated to the STBUS port will becomposed by several portions, generated as follows:

[0156] (31:23): the 8-Mbyte region containing the frames, constant, heldin a configuration register

[0157] (22:6): block address, as from AG stage

[0158] (5:0): block scansion: these will increment according to a fixedpattern to scan the whole block memory.

[0159] To simplify the refill engine and for more optimized memoryaccesses, the whole blocks will be loaded in cache, not single fields,even if the miss is caused by a field predictor.

[0160] The refill engine will be able to perform some look-ahead on theaddresses requested by the AG stage, in order to try and hide the stalllatency. This can be achieved by decoupling the tag lookup task fromactual cache memory access with an intermediate buffer, with a view tofind well in advance the next miss and proceed to pre-load the blockfrom memory. In fact, at the first miss, the cache memory access willstall, but tag lookup can continue to determine the next miss, takingcare of the tags configuration after that refill. As miss rate is in theorder of 2%, there is a fair chance that the next miss will be well awayfrom the current one. In fact, if it would be 10 or more addresseslater, we could hide up to 10 cycle of the next miss, provided we have a10 location buffer between tag look-up and cache memory access. Thisbuffer will have to hold the cache memory line that the addressgenerated by the AG will hit, up to the next miss or to buffer fullness.

[0161] The output of this stage to the Predictor Alignment (PA) block103 will therefore be, in 9 consecutive cycles, the 9 blocks in whichthe actual predictor is found. In case the predictor is a framepredictor, the whole 64 bytes for each block will be output. In case itis a field predictor, only the relevant field for each block will beaccessed in cache and output to the PA, to save power consumption.

[0162] control_word, mv, pred_pos as above

[0163] pixels(511:0): one prediction block (frame prediction)

[0164] pixels(255:0): one prediction block (field prediction)

[0165] pixels(511:256): “don't care” (field prediction)

[0166] The predictor alignment (PA) 103 will take the data of the9-block area in which the actual predictor resides and extract it withall the relevant operations, being it actual extraction of the 17 by 17(general case, with half pixel interpolation), horizontal and/orvertical half pixel interpolation, and bi-directional/dual primeprediction interpolation. This operation is achieved by reformatting theblock-based output of the PF into lines-of-macroblock output and byselecting the 17 by 17 array out of the 24 by 24 original one.

[0167] The reformatting is done through a buffer between PF and PAstages. This will be in principle a 24 by 24 pixels buffer, filled bythe PF and read by the PA.

[0168] To extract from the 24 by 24 array, corresponding to the 9 blocksincoming from PF, the 17 by 17 needed we need to select the 17appropriate row out of the 24 given; this is done by simply notselecting the 7 rows not needed. To extract the 17 pixels we will justuse a simple shifter, controlled by the least significant bits of the Xabsolute coordinate of the predictor.

[0169] Half pixel interpolation will be performed on-the-fly by 8-bitadders, 9-bit increment and discarding as appropriate during processingthe lsb's to return to 8-bit accuracy. Further details are shown in FIG.13.

[0170] This arrangement will save some of the adders needed for halfpixel interpolation, as a “conventional” implementation can be envisagedusing 3 adders plus one increment per pixel, while here 2 adders plus anincrement are used one pixel latch register will also be saved, as storethe result of horizontal interpolation of the line above (needed forvertical interpolation), instead of the two original pixels, will bestarted.

[0171] ver_half_pel and hor_half_pel indicate if half pixelinterpolation is needed; these signals stay constant for the wholepredictor.

[0172] A temporary buffer of 16 by 16 pixels is also needed to performpredictors interpolations, for bidirectional and dual-prime prediction.In this case, the first predictor is stored, to be then interpolatedon-the-fly when the second component becomes available. For thispurpose, a third set of interpolators is needed. Additional details areshown in FIG. 13.

[0173] The output will be a single line of 16 pixels per clock cycle.This output will last for 16 cycles in case of frame mode matching, or 8cycles for field mode. Another flag signaling the last line for thecurrent matching will be output in order to allow the distengine to stopthe accumulation of the MAE and output it to the decisions block.

[0174] control_word, mv, pred_pos as above

[0175] last_line active when last line of the predictor is output

[0176] pred_pixel(127:0) the predictor's pixels to test

[0177] The stage designated 104 (i.e., the CMB Fetch and Distengine,briefly CFD) is responsible for computing the actual MAE of the selectedMV. As the Current MacroBlock (CMB) is not used by any of the precedingstages, it is fetched from memory. Fetch will happen prior to CMB usagein order to hide the load latency. So, while processing CMB n, CMB n+1will be fetched when the STBUS port is not used to load predictorsblocks. In order to do this, a temporary buffer of 256 pixels is needed,in addition to the 256 pixels needed for the CMB under estimation.

[0178] The P CMB feed through described in the foregoing is implementedhere, with a simple macroblock buffer, to hold the coarse searchmacroblock, optionally post-processed by the MCNR. Therefore there is arequirement for the MCNR to be able to complete its filtering in amacroblock period. The MCNR will start processing the macroblock as soonas the coarse search finishes, and ideally should finish before the endof the current macroblock period. Because coarse search is far lesscomplex than fine search, it is fair to assume it will take less timethan fine. Therefore it must complete before ½ the macroblock period.MCNR must then complete its processing before the end of the period,having at least ½ macroblock period to complete. It will overwrite theCMB in memory, and also the copy in the feed through buffer, so thatfine search will use it correctly. In case the delay between coarse andfine is greater than one MB period, fine search will reload the correctCMB directly from memory, once again assuring correct operation.

[0179] The total buffering means sums up to 256*3=768 bytes. Whileprocessing the CMB, one macroblock line (16 pixels=128 bits) is accessedat a cycle. Therefore, this 3-macroblock buffers can be implemented by asingle ported single memory with 48 words of 128 bits each. In thiscase, while fetching and writing to this memory the next CMB, thedistengine will not be able to process. But as this stall can be limitedto 16 cycles, this is not forecast as a major problem. The alternativeimplementation would require 256*3*8=6144 flip-flops.

[0180] As far as the distengine implementation is concerned, themicroarchitecture is as shown in FIG. 14.

[0181] In order to speed up the decision function block task, theDistengine will also compute the mean of the prediction error andcurrent macroblock. The Distengine will be programmable (via controlword bits) for field or frame matching. In the first case thepredictor/current will contain 8 lines; in the second, it will contain16 lines. Another issue arises for compatibility with MPEG-4 and H263block (vs. Macroblock) matching. For example, H263 standard allows 8×8pixels frame mode prediction. To allow multi-standard capability, SHEshould therefore support these 8×8 mode as well. This could beimplemented by adding a flag in the control word to signal this 8×8prediction mode is enabled. The stages before distengine could in afirst implementation continue to fetch the standard 17 by 17 area. Whenthe prediction/current is fed to the distengine, it will gather theresult from the 8 by 8 frame only. A second most efficientimplementation would be to make the AG, PF, PA stages sensitive to theflag as well. This would increase marginally logic complexity, but willreduce data movement, with beneficial effects on power consumption.

[0182] control_word, mv, pred_pos as above

[0183] mae(15:0) mae value for this matching; unsigned integer quantity

[0184] pred_err_sum(16:0) sum of the pixel by pixel prediction error,modulo-2 signed integer quantity

[0185] cmb_sum(15:0) sum of all the cmb pixels; unsigned integerquantity; this can be computed only once per estimation and then gatedout for power consumption issues

[0186] The decision stage 105 is actually split in two sub functions:one to gather all the partial results of the current block estimation,the other to compute the macroblock coding decision functions on themotion estimation winner. To be able to compute the coding decisionfunctions, the data of the current macroblock under estimation and itsbest predictor, plus the no_mc predictor for P pictures are needed.Therefore, a RAM will be needed in order to store the winner for eachprediction mode. This leads to the following memory requirements: For Ppictures: Current macroblock: 256 bytes Frame mode predictor winner: 256bytes No_mc predictor: 256 bytes Field/dual_prime top winner: 128 bytesField/dual_prime bottom winner: 128 bytes Dual_prime temp buffer: 128bytes Temp buffer (incoming predictor): 256 bytes Total: 1408 bytes. ForB pictures: Current macroblock: 256 bytes Frame mode predictor winner:256 bytes Field/dual_prime top forward winner: 128 bytesField/dual_prime bottom forward winner: 128 bytes Field/dual_prime topbackward winner: 128 bytes Field/dual_prime bottom backward winner: 128bytes Temp buffer (incoming predictor): 256 bytes Total: 1280 bytes.

[0187] I pictures will just need Current MB for DCT type decision.

[0188] Additional information that needs to be stored are motion vector(32 bits) and MAE value (16 bits) for each of the mode winners andcurrent predictor.

[0189] When a new MAE arrives, it will be compared with the currentwinner for the mode to which the predictor belongs, and if less than orequal, it will replace the current winner. The memory will actually beorganized as circular buffers, so that the position of each mode winnercan be in different part of the memory, in order not to physically modedata when a mode winner is updated. This will require a few additionalstorage bits for each mode winner, to point to the position in memorywhere the predictor resides. Because each predictor is 128 or 256 bytes,one just needs to identify which of the 128-byte regions are used byeach predictor; because 12 of these regions in 1.5 KB of memory exist,only 4 bits are needed for this purpose. To be sure that memoryfragmentation is avoided, new field mode predictions will be saved inthe uppermost free part of the memory, while new frame mode predictorswill occupy the lowest part of the memory available.

[0190] The second task that needs to be done is the decision of themacroblock coding type. For this purpose the current macroblock, theprediction winner and the no_mc winner for P pictures are needed. Thefunctions needed to compute are intra_macroblock SMA, inter_macroblockSMA, no_mc SMA, and then DCT field_difficulty and frame_difficulty.

[0191] This task is done either sequentially or in parallel with motionestimation. In the first case the issue of motion vectors will bestopped to allow the mode winners memory to be accessed by the decisionfunctions logic. Alternatively a double banked predictors memory can beused, which will require to double the predictors winners memory, adding1.5 KB of memory. It would then be possible to swap banks between motionestimations partial results gathering and the coding decision task.

[0192] Once all the decisions have been taken, the current MB, itscomputed MV with the final luma predictor and prediction error areavailable. These results can be sent via DMA in memory into a“prediction error frame buffer” ready to be used by the loop encoder.The associated MV and coding decision taken can be put in an appropriatedata structures in memory. In addition, an extra function of chromaprediction gathering could be inserted in the engine.

[0193] The engine will have also to feed back the winner coarse & fineMV winner to the MVG MV fifo for it to be able to recursively generatevectors.

[0194] Finally the flow of a motion vector to be tested through thepipeline, as depicted in FIG. 15, will be described in detail. It mustbe understood that between each of the blocks there will be bufferingmeans to be able to decouple to a certain degree the operations of thestages. These buffers will be working as FIFO with overflow/underflowcontrol, in order that no data will be lost in case the buffers are fulland no data is output if buffer is empty. This will be done throughhandshake of each stage input and output to the buffers. The stages willstall in case the output buffer is full and/or the input buffer isempty. This will allow to treat correctly events like cache misses, MVGdelays, and so on. The situation depicted in FIG. 15 assumes that allthese buffers are empty at the moment when the MV in the examplearrives. For power consumption issues, it is recommended that when astage is stalling due to buffer unavailability, the clock will not tick,i.e., the clock will be gated by theinput_buffer_empty/output_buffer_full signals.

[0195] As soon as a motion vector is issued from the MVG, it will go tothe address generation input buffer. The size of this buffer ischaracterized in terms of latencies. The address generator will thenpick up the vector and issue in nine consecutive cycles the 9 addressesneeded to extract the predictor. Some of these might be flagged as“void” as the predictor will not actually contain pixels from thatblock, but in any case the processing will still take 9 cycles.Addresses flagged as void contain “don't care” and implementationdependent data.

[0196] Those addresses will go to the fetch input buffer. It isrecommended that at least 8-10 positions will be available in thisbuffer, to perform efficient miss look-ahead as previously described.Once in the fetch stage the addresses will be compared with the cachecontent, and if no miss happens, the blocks are output in nineconsecutive cycles. In case any miss happening, the output of the blockthat generated the miss will be delayed by the time taken to load thedata from main memory. This in turn will make all the previous andsubsequent stages to stall due to buffers being full or empty, allowingcorrect handling of the miss stall.

[0197] The 9 blocks will be then output directly to the PA stage. Inorder to be able to extract prediction lines out of the blocks, a ‘blockto lines’ buffer of at least 3 blocks, or 6 blocks for more efficientimplementation is needed. A 3-block buffer will in fact add a 3 cyclelatency every time we need to refill it once it has delivered theinitial content. This can be hidden with a 6 blocks buffers, so that thenext 3-blocks data can be received while the first 3-block lines aredelivered. With this buffer arrangement, delivery of one line ofpredictor (apart from first cycle delay in case of vertical half pixel)can be sustained for each cycle from the PA.

[0198] The PA will start, as soon as it has available the first line ofthe predictor, to output it to the distengine in 8 (field mode) or 16(frame mode) subsequent clock cycles. The suggested microarchitecture ofthe PA block will use one initial delay cycle prior to output thepredictor in case of vertical half pixel interpolation, and no delayswhen vertical half pixel is not used. No buffering is needed between PAand CFD, and the transfer will be based on simple handshake mechanism.The distengine will output the MAE result, which can be taken withoutany buffering by the decisions block.

[0199] From the foregoing it will be appreciated that, although specificembodiments of the invention have been described herein for purposes ofillustration, various modifications may be made without deviating fromthe spirit and scope of the invention. Accordingly, the invention is notlimited except as by the appended claims.

1. A coprocessor circuit for processing image data in digital form,including: a motion vector controller block for generating, startingfrom said image data, motion vector values including predictor data andmacroblock data relating to a current macroblock of said image data tobe estimated, said prediction data and macroblock data being adapted tobe stored at respective memory addresses, an address generator block forextracting said respective addresses from said motion vector values, apredictor fetch block for retrieving said predictor data based onrespective addresses extracted by said address generator block, acurrent macroblock fetch and distengine block for retrieving saidmacroblock data based on respective addresses extracted by said addressgenerator block and for processing said macroblock data according to agiven function, and a decision block for collecting said retrieved dataas partial results and selecting the best result therefrom.
 2. Thecircuit according to claim 1 wherein said motion vector controller blockis implemented as a DSP.
 3. The circuit according to claim 1 whereinsaid motion vector controller block is arranged to run a microcode. 4.The circuit according to claim 3 wherein said motion vector controllerblock has associated therewith a memory, preferably of the flash type,for storing said microcode.
 5. The circuit according to claim 1 whereinsaid circuit is arranged to perform two distinct estimation steps,namely a coarse search and a fine search, respectively, of said imagedata, said estimation steps being carried out in parallel on differentmacroblocks.
 6. The circuit according to claim 5 wherein the circuitincludes time-sharing hardware resources to generate in parallel theresult of the coarse search for a macroblock and the result of the finesearch for another macroblock.
 7. The circuit according to claim 1wherein the circuit includes temporal noise reduction means attached atthe output of the decision block to noise-reduce said image data.
 8. Thecircuit according to claim 5 wherein said noise reduction means performmotion compensated noise level detection and reduction based on themotion vectors resulting from the coarse search, preferably by using asinputs the coarse search current macroblock and its predictor block. 9.The circuit according to claim 8 wherein said noise reduction meansoutput a noise-reduced version of the current macroblock that willoverwrite the noise corrupted one.
 10. The circuit according to claim 1wherein said motion vector controller block is arranged to perform atleast one ancillary function selected from the group consisting of scenechange detection, inverse 3/2 pull down, interlace/progressive contentdetection, f code adaptation.
 11. The circuit according to claim 1wherein said motion vector controller block is arranged to perform atleast one function selected from the group consisting of counting thecycles spent to estimate the current macroblock, inserting stall orpower down cycles or additional motion vector tests to ensuresynchronization with input data.
 12. The circuit according to claim 1wherein said motion vector controller block includes a local memoryadapted to receive slices of said motion vectors.
 13. The circuitaccording to claim 12 wherein said motion vector controller block hasassociated therewith slice FIFOs of a first type containing motionvector data resulting from previous estimation of the macroblock in thesame frame and of a second type containing results from estimations ofmacroblocks in previous pictures or previous passes of prediction. 14.The circuit according to claim 1 wherein said address generator block isarranged to output the addresses required to fetch said predictor datain sequential cycles.
 15. The circuit according to claim 1 wherein saidaddress generator block is arranged to issue as voids at least some ofsaid addresses not requiring loading when the absolute coordinates ofthe predictors are block aligned.
 16. The circuit according to claim 1wherein said predictor fetch block has associated therewith an internalmemory managed as a cache memory.
 17. The circuit according to claim 16wherein said predictor fetch block loads the search windows pixels ofsaid image data selectively and/or buffers them in said internal memoryby dynamic allocation.
 18. The circuit according to claim 14 whereinsaid predictor fetch block has a bus access limiter coupled to the cacherefill engine.
 19. The circuit according to claim 18 wherein said busaccess limiter is arranged for clipping high-bandwidth peaks.
 20. Thecircuit according to claim 18 wherein said bus access limiter acts at amacroblock by macroblock level.
 21. The circuit according to claim 18wherein said bus access limiter has a selectively variable maximumallowed bandwidth value.
 22. The circuit according to claim 16 whereinsaid cache memory is organized as a multiway, preferably as a 4-way setassociative memory.
 23. The circuit according to claim 22 wherein saidpredictor fetch block is arranged to permit selective reading of blocksin each line of said cache memory, thereby permitting all the bytes ofeach block or only the blocks belonging to one field to be selectivelyread.
 24. The circuit according to claim 16 wherein said cache memory isarranged in order to permit writing of data therein only when refillingthe respective refill engine.
 25. The circuit according to claim 16wherein within said cache memory tag lookup and access operations areperformed sequentially in subsequent clock cycles.
 26. The circuitaccording to claim 16 wherein said cache memory is physically composedof a single piece instead of N, where N is the number of ways in whichsaid cache is logically organized.
 27. The circuit according to claim 16wherein the circuit includes an intermediate buffer to decouple the taglookup task from memory access in said cache memory.
 28. The circuitaccording to claim 16 wherein said cache memory is arranged, preferablyat the refill engine level, to find in advance the next miss and proceedto pre-load the block from memory.
 29. The circuit according to claim 27wherein, at the first miss, the cache memory access stalls, but taglookup continues to determine the next miss, preferably by taking careof the tags configuration after that refill.
 30. The circuit accordingto claim 1 wherein said predictor fetch block has associated therewith apredictor alignment block to reformat a block-based output of saidpredictor fetch block into a lines-of-macroblock output and selecting asub-array out of the original array or the output of said predictorfetch block.
 31. The circuit according to claim 30 wherein saidpredictor alignment block includes a respective buffer filled by saidpredictor fetch block.
 32. The circuit according to claim 30 whereinsaid predictor alignment block is arranged to perform interpolation ofthe data transferred from said predictor fetch block towards said fetchand distengine block.
 33. The circuit according to claim 1 wherein saidfetch and distengine block applies, as said given function, the meanabsolute error over a given macroblock of the sum of absolutedifferences produced by pixel comparison.
 34. The circuit according toclaim 23 wherein said fetch and distengine block is arranged as amonodimensional array of computing elements.
 35. The circuit accordingto claim 33 wherein said monodimensional array is a monodimensionalarray of SAD elements.
 36. The circuit according to claim 5 wherein saidfetch and distengine block includes a macroblock buffer to store coarsesearch macroblocks in order to permit processing each macroblock as soonas the coarse search finishes.
 37. The circuit according to claim 36wherein said macroblock buffer is implemented as single ported memory.38. The circuit according to claim 1 wherein said fetch and distengineblock includes a programmable distengine module for field or framematching.
 39. The circuit according to claim 1 wherein said decisionblock includes a first module to gather the partial result of currentblock estimation and a second module to compute the macroblock codingdecision functions on the motion estimation winner.
 40. The circuitaccording to claim 39 wherein the circuit includes a decision memory,preferably a RAM, to store the winner for each prediction mode.
 41. Thecircuit according to claim 1 wherein the decision block is arranged tocompare new data obtained by applying said given function with a currentwinner for the mode to which the predictor belongs and if the currentwinner is less than or equal the new data, the new data will replace thecurrent winner.
 42. The circuit according to claim 1 wherein saiddecision block performs decision of the macroblock coding typesequentially or in parallel with respect to motion estimation.
 43. Thecircuit according to claim 42 wherein said decision of the macroblockcoding type is performed sequentially with respect to motion estimationand in that the issue of motion vectors is stopped to allow the modewinners memory to be accessed.
 44. The circuit according to claim 1wherein the circuit is formed on a monolithic integrated circuitsubstrate.
 45. A method for processing an image data in digital formcomprising: generating motion vector values including predictor data andmacroblock data from input image data to be estimated; extractingrespective addresses from said motion vector values; retrieving saidpredictor data based on respective addresses extracted from the motionvector values; retrieving said macroblock data based on respectiveaddresses extracted from said motion vector values; and collecting saidretrieved macroblock data as partial results and selecting from saidpartial results a preferred data set.
 46. The method according to claim45 wherein said generating step includes: performing a coarse search onsaid image data to perform a first estimation step; and performing afine search on the same image data to perform a second estimation step.47. The method according to claim 46, further comprising: performingmotion compensated noise level detection; and reducing the noise levelbased on the motion vectors resulting from the coarse search.
 48. Themethod according to claim 45, further including: outputting theaddresses required to fetch said predictor data in sequential cycles.49. The method according to claim 14, further including: issuing asvoids at least some of the addresses not requiring loading when theabsolute coordinates of the predictor block are aligned.
 50. The methodaccording to claim 45, further including: continuing to perform taglookups to determine the next miss when the cache memory access stalls.